990 research outputs found

    Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

    Full text link
    In this paper we show that reporting a single performance score is insufficient to compare non-deterministic approaches. We demonstrate for common sequence tagging tasks that the seed value for the random number generator can result in statistically significant (p < 10^-4) differences for state-of-the-art systems. For two recent systems for NER, we observe an absolute difference of one percentage point F1-score depending on the selected seed value, making these systems perceived either as state-of-the-art or mediocre. Instead of publishing and reporting single performance scores, we propose to compare score distributions based on multiple executions. Based on the evaluation of 50.000 LSTM-networks for five sequence tagging tasks, we present network architectures that produce both superior performance as well as are more stable with respect to the remaining hyperparameters.Comment: Accepted at EMNLP 201

    Alternative Weighting Schemes for ELMo Embeddings

    Full text link
    ELMo embeddings (Peters et. al, 2018) had a huge impact on the NLP community and may recent publications use these embeddings to boost the performance for downstream NLP tasks. However, integration of ELMo embeddings in existent NLP architectures is not straightforward. In contrast to traditional word embeddings, like GloVe or word2vec embeddings, the bi-directional language model of ELMo produces three 1024 dimensional vectors per token in a sentence. Peters et al. proposed to learn a task-specific weighting of these three vectors for downstream tasks. However, this proposed weighting scheme is not feasible for certain tasks, and, as we will show, it does not necessarily yield optimal performance. We evaluate different methods that combine the three vectors from the language model in order to achieve the best possible performance in downstream NLP tasks. We notice that the third layer of the published language model often decreases the performance. By learning a weighted average of only the first two layers, we are able to improve the performance for many datasets. Due to the reduced complexity of the language model, we have a training speed-up of 19-44% for the downstream task

    The precision of line position measurements of unresolved quasar absorption lines and its influence on the search for variations of fundamental constants

    Full text link
    Optical quasar spectra can be used to trace variations of the fine-structure constant alpha. Controversial results that have been published in last years suggest that in addition to to wavelength calibration problems systematic errors might arise because of insufficient spectral resolution. The aim of this work is to estimate the impact of incorrect line decompositions in fitting procedures due to asymmetric line profiles. Methods are developed to distinguish between different sources of line position shifts and thus to minimize error sources in future work. To simulate asymmetric line profiles, two different methods were used. At first the profile was created as an unresolved blend of narrow lines and then, the profile was created using a macroscopic velocity field of the absorbing medium. The simulated spectra were analysed with standard methods to search for apparent shifts of line positions that would mimic a variation of fundamental constants. Differences between position shifts due to an incorrect line decomposition and a real variation of constants were probed using methods that have been newly developed or adapted for this kind of analysis. The results were then applied to real data. Apparent relative velocity shifts of several hundred meters per second are found in the analysis of simulated spectra with asymmetric line profiles. It was found that each system has to be analysed in detail to distinguish between different sources of line position shifts. A set of 16 FeII systems in seven quasar spectra was analysed. With the methods developed, the mean alpha variation that appeared in these systems was reduced from the original Dalpha/alpha=(2.1+/-2.0)x10^-5 to Dalpha/alpha=(0.1+/-0.8)x10^-5. We thus conclude that incorrect line decompositions can be partly responsible for the conflicting results published so far

    Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks

    Full text link
    Selecting optimal parameters for a neural network architecture can often make the difference between mediocre and state-of-the-art performance. However, little is published which parameters and design choices should be evaluated or selected making the correct hyperparameter optimization often a "black art that requires expert experiences" (Snoek et al., 2012). In this paper, we evaluate the importance of different network design choices and hyperparameters for five common linguistic sequence tagging tasks (POS, Chunking, NER, Entity Recognition, and Event Detection). We evaluated over 50.000 different setups and found, that some parameters, like the pre-trained word embeddings or the last layer of the network, have a large impact on the performance, while other parameters, for example the number of LSTM layers or the number of recurrent units, are of minor importance. We give a recommendation on a configuration that performs well among different tasks.Comment: 34 pages. 9 page version of this paper published at EMNLP 201

    Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches

    Full text link
    Developing state-of-the-art approaches for specific tasks is a major driving force in our research community. Depending on the prestige of the task, publishing it can come along with a lot of visibility. The question arises how reliable are our evaluation methodologies to compare approaches? One common methodology to identify the state-of-the-art is to partition data into a train, a development and a test set. Researchers can train and tune their approach on some part of the dataset and then select the model that worked best on the development set for a final evaluation on unseen test data. Test scores from different approaches are compared, and performance differences are tested for statistical significance. In this publication, we show that there is a high risk that a statistical significance in this type of evaluation is not due to a superior learning approach. Instead, there is a high risk that the difference is due to chance. For example for the CoNLL 2003 NER dataset we observed in up to 26% of the cases type I errors (false positives) with a threshold of p < 0.05, i.e., falsely concluding a statistically significant difference between two identical approaches. We prove that this evaluation setup is unsuitable to compare learning approaches. We formalize alternative evaluation setups based on score distributions

    The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes

    Full text link
    Information Retrieval using dense low-dimensional representations recently became popular and showed out-performance to traditional sparse-representations like BM25. However, no previous work investigated how dense representations perform with large index sizes. We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations. We show that this behavior is tightly connected to the number of dimensions of the representations: The lower the dimension, the higher the chance for false positives, i.e. returning irrelevant documents.Comment: Published at ACL 202

    Generalizing Cross-Document Event Coreference Resolution Across Multiple Corpora

    Full text link
    Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents. CDCR aims to benefit downstream multi-document applications, but despite recent progress on corpora and system development, downstream improvements from applying CDCR have not been shown yet. We make the observation that every CDCR system to date was developed, trained, and tested only on a single respective corpus. This raises strong concerns on their generalizability -- a must-have for downstream applications where the magnitude of domains or event mentions is likely to exceed those found in a curated corpus. To investigate this assumption, we define a uniform evaluation setup involving three CDCR corpora: ECB+, the Gun Violence Corpus and the Football Coreference Corpus (which we reannotate on token level to make our analysis possible). We compare a corpus-independent, feature-based system against a recent neural system developed for ECB+. Whilst being inferior in absolute numbers, the feature-based system shows more consistent performance across all corpora whereas the neural system is hit-and-miss. Via model introspection, we find that the importance of event actions, event time, etc. for resolving coreference in practice varies greatly between the corpora. Additional analysis shows that several systems overfit on the structure of the ECB+ corpus. We conclude with recommendations on how to achieve generally applicable CDCR systems in the future -- the most important being that evaluation on multiple CDCR corpora is strongly necessary. To facilitate future research, we release our dataset, annotation guidelines, and system implementation to the public.Comment: Accepted at CL Journa

    Key Recovery Attack on QuiSci

    Get PDF
    This paper shows a key recovery attack on QuiSci (quick stream cipher), designed by Stefan Müller (FGAN-FHR, a German research institute) in 2001. With one or few know plaintexts it\u27s possible to recover most of the key with negligible time complexity. This paper shows a way how to exploit the weak key setup of QuiSci

    Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

    Full text link
    There are two approaches for pairwise sentence scoring: Cross-encoders, which perform full-attention over the input pair, and Bi-encoders, which map each input independently to a dense vector space. While cross-encoders often achieve higher performance, they are too slow for many practical use cases. Bi-encoders, on the other hand, require substantial training data and fine-tuning over the target task to achieve competitive performance. We present a simple yet efficient data augmentation strategy called Augmented SBERT, where we use the cross-encoder to label a larger set of input pairs to augment the training data for the bi-encoder. We show that, in this process, selecting the sentence pairs is non-trivial and crucial for the success of the method. We evaluate our approach on multiple tasks (in-domain) as well as on a domain adaptation task. Augmented SBERT achieves an improvement of up to 6 points for in-domain and of up to 37 points for domain adaptation tasks compared to the original bi-encoder performance.Comment: Accepted at NAACL 202

    GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

    Full text link
    Dense retrieval approaches can overcome the lexical gap and lead to significantly improved search results. However, they require large amounts of training data which is not available for most domains. As shown in previous work (Thakur et al., 2021b), the performance of dense retrievers severely degrades under a domain shift. This limits the usage of dense retrieval approaches to only a few domains with large training datasets. In this paper, we propose the novel unsupervised domain adaptation method Generative Pseudo Labeling (GPL), which combines a query generator with pseudo labeling from a cross-encoder. On six representative domain-specialized datasets, we find the proposed GPL can outperform an out-of-the-box state-of-the-art dense retrieval approach by up to 9.3 points nDCG@10. GPL requires less (unlabeled) data from the target domain and is more robust in its training than previous methods. We further investigate the role of six recent pre-training methods in the scenario of domain adaptation for retrieval tasks, where only three could yield improved results. The best approach, TSDAE (Wang et al., 2021) can be combined with GPL, yielding another average improvement of 1.4 points nDCG@10 across the six tasks. The code and the models are available at https://github.com/UKPLab/gpl.Comment: Accepted at NAACL 202
    • …
    corecore